Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
نویسندگان
چکیده
In this paper, we examine the problem of indexing over non-metric distance functions. In particular, we focus on a general class of distance functions, namely Bregman Divergence [6], to support nearest neighbor and range queries. Distance functions such as KL-divergence and Itakura-Saito distance, are special cases of Bregman divergence, with wide applications in statistics, speech recognition and time series analysis among others. Unlike in metric spaces, key properties such as triangle inequality and distance symmetry do not hold for such distance functions. A direct adaptation of existing indexing infrastructure developed for metric spaces is thus not possible. We devise a novel solution to handle this class of distance measures by expanding and mapping points in the original space to a new extended space. Subsequently, we show how state-of-the-art tree-based indexing methods, for low to moderate dimensional datasets, and vector approximation file (VA-file) methods, for high dimensional datasets, can be adapted on this extended space to answer such queries efficiently. Improved distance bounding techniques and distribution-based index optimization are also introduced to improve the performance of query answering and index construction respectively, which can be applied on both the R-trees and VA files. Extensive experiments are conducted to validate our approach on a variety of datasets and a range of Bregman divergence functions.
منابع مشابه
Hardness and Non-Approximability of Bregman Clustering Problems
We prove the computational hardness of three k-clustering problems using an (almost) arbitrary Bregman divergence as dissimilarity measure: (a) The Bregman k-center problem, where the objective is to find a set of centers that minimizes the maximum dissimilarity of any input point towards its closest center, and (b) the Bregman k-diameter problem, where the objective is to minimize the maximum ...
متن کاملEfficient Bregman Range Search
We develop an algorithm for efficient range search when the notion of dissimilarity is given by a Bregman divergence. The range search task is to return all points in a potentially large database that are within some specified distance of a query. It arises in many learning algorithms such as locally-weighted regression, kernel density estimation, neighborhood graph-based algorithms, and in tas...
متن کاملOverlapping clustering based on kernel similarity metric
Producing overlapping schemes is a major issue in clustering. Recent proposed overlapping methods relies on the search of an optimal covering and are based on different metrics, such as Euclidean distance and I-Divergence, used to measure closeness between observations. In this paper, we propose the use of another measure for overlapping clustering based on a kernel similarity metric .We also e...
متن کاملNon-Metric Locality-Sensitive Hashing
Non-metric distances are often more reasonable compared with metric ones in terms of consistency with human perceptions. However, existing locality-sensitive hashing (LSH) algorithms can only support data which are gauged with metrics. In this paper we propose a novel locality-sensitive hashing algorithm targeting such non-metric data. Data in original feature space are embedded into an implici...
متن کاملThe Basic Principles of Metric Indexing
This chapter describes several methods of similarity search, based on metric indexing, in terms of their common, underlying principles. Several approaches to creating lower bounds using the metric axioms are discussed, such as pivoting and compact partitioning with metric ball regions and generalized hyperplanes. Finally, pointers are given for further exploration of the subject, including non-...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 2 شماره
صفحات -
تاریخ انتشار 2009